We are introduced to the main functions we’ll need
We go over some tidy workflows to clarify
Setup
For these slides we will only need the tidyverse package.
library(tidyverse) # All-in-one
tidy data
But what exactly are tidy data? It is not just a a buzz word, there is a real definition. In three parts, to be exact. Taken from Hadley Wickham’s R for Data Science:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
Following three rules make a dataset tidy — variables are in columns, observations are in rows, and values are in cells. Reproduced from R for Data Science
One will generally satisfy these three rules effortlessly simply by never putting more than one dataset in a file, and never putting more (or less) than one variable in the same column. We will go over this several more times today so do not fret if those guidelines are not immediately clear.
Tidy workflow
The tidying of ones data should be the second step in any workflow, after the loading of the data.
Data tidying in the data processing pipeline. Reproduced from R for Data Science
The basics
To illustrate the implementation of this three part definition in a workflow we are going to learn how to manipulate a non-tidy dataset into a tidy one
We will use the NOAA OISST data for the three time series from 2008 - 2009, with some notable (untidy) changes
The secondary purpose of these exercises is to illustrate that these steps may be done more quickly in R than excel
Thereby allowing for ones raw data to remain exactly how they were collected and having all of the manipulations performed on them documented in an R script
This is a centrally important part of reproducible research.
load data
Note that this loads five different objects
Each object contains the same data in different states of disrepair
Some are easier to tidy than others
load("../data/OISST_mangled.RData")
SACTN1SACTN2SACTN3# Spread across two dataframesSACTN4aSACTN4b
OISST1
If these data look like the previous data we’ve used thus far that’s because they are
These are how tidy data should look. No surprises.
In fact, because these data are already tidy it is very straightforward to use them for whatever purposes we may want
Making a time series plot, for example
ggplot(data = OISST1, aes(x = t, y = temp)) +geom_line(aes(colour = site)) +labs(x ="", y ="Temperature (°C)", colour ="Site") +theme_bw()
%>%
Remember that this funny series of symbols is the pipe operator. It combines consequetive rows of code together so that they run as though they were one ‘chunk’. We will be seeing this symbol a lot today. The keyboard shortcut for %>% is ctrl shift m.
pivoting
Before tidy became the adjective used to describe neatly formatted data, people used to say long. This is because well organised dataframes tend to always be longer than they are wide (with the exception of species assemblage data). The opposite of long data are wide data. If one ever finds a dataset that is wider than it is long then this is probably because the person that created them saved one variable across many columns.
As we sit here and read through these examples it may seem odd that so much effort is being spent on something so straightforward as tidy data. Surely this is too obvious to devote an entire day of work to it? Unfortunately not. As we go out into the wild world of ‘wild data’, we tend to find that very few datasets (especially those collected by hand) are tidy. Rather they are plagued by any number of issues. The first step then for tidying up the data are to have a look at them and discern what are the observations that were made/recorded, and what are the variables within those observations. Let’s have a look now at OISST2 for an example of what wide data looks like, and how to fix it.
When data are too wide
In OISST2 we can see that the site column has been removed
Temperatures are placed in columns that denote the site
This may seem like a reasonable organisation, but it is not tidy because the collecting source is one variable and should not take up more than one column
We need to pivot_longer() these site columns back together
We do this by telling pivot_longer() what the names of the columns are we want to squish together
We then tell it the name of the identification and value columns
We’ve now covered how to make our dataframes longer or wider depending on their tidiness. Now we will look at how to manage our columns when they contain more (or less) than one variable, but the overall dataframe does not need to be made wider or longer. This is generally the case when one has a column with two variables, or two or more variables are spread out across multiple columns, but there is still only one observation per row. Let’s see some examples to make this more clear.
Separate
OISST4a hassite and t column replaced by index
This is an efficient way to store these data, but it is not tidy because the site and source of each observation are separate variables
To re-create our site and t columns separate() the index column
Because we are creating new column names we tell these to R within inverted commas
Lastly we should tell R how to separate the index column
Often times the separate() function is able to guess correctly, but it is better to be explicit
left_join() will often guess what we want to do and provide a message in the Console
But it is better to be explicit
OISST4_tidy <-left_join(OISST4a_tidy, OISST4b_tidy, by =c("site", "t"))
But why though?
At this point one may be wondering what the point of all of this is. Sure it’s all well and good to see how to tidy one’s data in R, but couldn’t this be done more quickly and easily in Excel? Perhaps, yes, with a small dataset. But remember, (for many) the main reason we are learning R is to ensure that we are performing reproducible research. This means that every step in our workflow must be documented. And we accomplish this by writing R scripts.